This code seeks to explain how do organoids from patients reflect the generality of OV patients. We do clustering of patients and organoids and see where organoids fall – based on signatures, and also based on raw CN profiles.

To do: Create the new britroc OV exposures using the new code that Ruben has provided and the new absolute copy number segments that he has provided too.

Considerations

  • From conversation with Geoff 20200129, we will be getting different absolute copy number files (and their corresponding exposures) because the method of purity/ploidy inference has been improved.
  • OV-US and OV-AU are both ICGC
  • version of the TCGA etc signatures: the new exposures (sent by Ruben on late November) I just have for TCGA, but this is because the ASCAT segments have been modified differently, and it shouldn’t make a difference for non-SNP array.
  • version of the organoid signatures: I am using version “organoid_exposures_Aug21.rds”

Changes in signatures extraction (from Ruben) - removing a big from one the features: the first segment was not counted, whih is not too important for OV - the pre-processing of CN segments (only applicable to SNP array)

The previous data are:

  • 132 patients (BriTROC-1) using low-cost shallow whole-genome sequencing (sWGS; 0.1×)
  • 112 dWGS HGSOC cases from the Pan-Cancer Analysis of Whole Genomes (PCAWG)
  • 415 HGSOC cases with SNP array and whole-exome sequencing data from The Cancer Genome Atlas (TCGA)

BriTROC: there are the original BriTROC segments (from manuscript) and new BriTROC segments (called BriTROC 2, here, but it’s not the BriTROC-2 cohort! made by Ruben).

Loading organoids data

org<- as(readRDS("data/organoid_exposures.rds"), 'matrix')
# rownames(org) <- paste0('Sample ', 1:nrow(org))
names_orgs = readxl::read_xlsx("data/NewOrganoidNaming.xlsx")
names_orgs$`new name`[match(rownames(org), paste0(names_orgs$`old name`, 'org'))]
##  [1] "PDO16" "PDO15" "PDO3"  "PDO9"  "PDO5"  "PDO6"  "PDO10" "PDO1"  "PDO12"
## [10] "PDO4"  "PDO2"  "PDO18" "PDO7"  "PDO17" "PDO8"  "PDO13" "PDO14" "PDO11"
rownames(org) = names_orgs$`new name`[match(rownames(org), paste0(names_orgs$`old name`, 'org'))]

Data from organoids

rename_rows = function(i, new_names){
  rownames(i) = new_names; return(i)}
## Creating plot... it might take some time if the data are large. Number of samples: 18

Apparently only PDO1 is a WGD sample. I see which log-ratio with any other signature shows a clear difference between this WGD sample and the rest. The ratio between s4 (WGD signature) and s2 is.

##   Var1 Var2      value group
## 1 PDO1    1  0.2064604   WGD
## 2 PDO1    2 13.6860836   WGD
## 3 PDO1    3  3.9450050   WGD
## 4 PDO1    4  1.0000000   WGD
## 5 PDO1    5        Inf   WGD
## 6 PDO1    6  3.3746386   WGD
## 7 PDO1    7        Inf   WGD
## Warning: Removed 11 rows containing missing values (geom_point).

apply(org, 2, function(j) table(factor(j==0, levels=c(T,F))))
##       s1 s2 s3 s4 s5 s6 s7
## TRUE   1  0  0  6  4  4  2
## FALSE 17 18 18 12 14 14 16
table(apply(org[,-5], 1, function(j) paste0(j==0,collapse='-')))
## 
## FALSE-FALSE-FALSE-FALSE-FALSE-FALSE  FALSE-FALSE-FALSE-FALSE-FALSE-TRUE 
##                                   9                                   2 
##  FALSE-FALSE-FALSE-TRUE-FALSE-FALSE   FALSE-FALSE-FALSE-TRUE-TRUE-FALSE 
##                                   3                                   3 
##   TRUE-FALSE-FALSE-FALSE-TRUE-FALSE 
##                                   1

Data from Nature Genetics 2018 paper

We are loading both the original signatures, and the updated signatures.

Number of zeros in exposures

We have two dataframes: with the previous TCGA samples and with the current ones. Both contain the BriTROC and ICGC to this as well (which are shared).

## The percentage of zeros in each cohort is:
## $organoids
## $organoids[[1]]
## [1] "13.492%"
## 
## 
## $ExposuresNatGen
## $ExposuresNatGen$britroc
## [1] "0%"
## 
## $ExposuresNatGen$`OV-AU`
## [1] "0%"
## 
## $ExposuresNatGen$`OV-US`
## [1] "0%"
## 
## $ExposuresNatGen$TCGA
## [1] "0%"
## 
## 
## $UpdatedExposures
## $UpdatedExposures$britroc
## [1] "0%"
## 
## $UpdatedExposures$`OV-AU`
## [1] "0%"
## 
## $UpdatedExposures$`OV-US`
## [1] "0%"
## 
## $UpdatedExposures$`Updated TCGA`
## [1] "24.305%"

This makes the organoids and the TCGA exposures sample, and leaves the other in the periphery of the PCA. I suspect this is due to the number of zero exposures, which are imputated using the robust analyses that I am using here:

  • The number of organoids is 18
  • The number of TCGA samples in the previous (published) cohort was 374.
  • The number of TCGA samples in the current (Ruben’s) cohort is 529.
  • The number of TCGA samples found in the previous cohort but not in the current is 0.
  • The number of TCGA samples found in the current cohort but not in the previous is 374.

We are only selecting the updated exposures, now

which_natgen = 'UpdatedSignatures'

PCA

For compositional data, in the book Analysing compositional data with R they say that PCA should be done on clr-transformed data. Zeroes are an issue if we use clr using all samples. The robust clr is implemented in the package compositions and deals with this problem by doing the geometric mean over only non-zero values, and setting the clr of a part which is zero to zero.

The plot done with (biplot(princomp(acomp(x)))) is the same as plotting princomp(as(clr(x), ‘matrix’))

Creating a PCA with the data from the clinical cohorts, and projecting the organoids

## Saving 7 x 5 in image
## Saving 7 x 5 in image

What is different in these ‘underrepresented’ clinical samples?

I.e., what type of signatures are not represented in the organoids?

Conclusion: it seems as though it’s signature 3, the relative abundance of which is never high in organoid samples.

I am comparing

  • the barplots of the exposures

  • CLR (centered log-ratio) of signature 3 is high in the underrepresented samples

  • the ratio of the sums of different signatures, e.g. the ratio of 1+3+5 vs 2+4+6+7.

  • ILR (isometric log-ratio) when splitting the dataset into s3 and all other signatures. It is the log-ratio of the exposure to signature 3 and the geometric mean of all other exposures.

## Creating plot... it might take some time if the data are large. Number of samples: 159
## Creating plot... it might take some time if the data are large. Number of samples: 50
## Creating plot... it might take some time if the data are large. Number of samples: 18

Loadings

Looking at the loadings. In particular, looking for components in the first and second PC

Respectively, using the first and the second batch of signatures.

Signatures 3 and 6 seem to be quite important for the underrepresented groups

Dendrograms

Dendrogram based on the signatures

The colour of the labels shows whether there is any zero exposure in the vector of exposures of the sample.

##  removed due to infinite values

## quartz_off_screen 
##                 2
## quartz_off_screen 
##                 2

Heatmap of the samples in the dendrogram

## quartz_off_screen 
##                 2
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.

## quartz_off_screen 
##                 2

## quartz_off_screen 
##                 2
labels(dendro_UpdatedExposures)[grep('Sample', labels(dendro_UpdatedExposures))]
## character(0)

What type of signatures appear on the branch with no organoids?

We are looking at the split plot below (i.e. the first split). We call ‘underrepresented’ the samples that fall on the right branch.

Which cohorts are these samples from?

underrepresented_labels
##   [1] "IM_87"                               
##   [2] "84ca6ab0-9edc-4636-9d27-55cdba334d7d"
##   [3] "IM_23"                               
##   [4] "IM_68"                               
##   [5] "d8c2b4b2-e12b-43d2-bafc-87b29f027797"
##   [6] "IM_100"                              
##   [7] "c767254e-b289-4904-a80f-050cf01ff8ba"
##   [8] "44eebc04-c027-45ae-beca-c4012b494f29"
##   [9] "da43386c-47f8-4e03-b6ca-8b94e13792e7"
##  [10] "9a5bb831-8cb8-4de0-b94b-088cb38def1a"
##  [11] "IM_155"                              
##  [12] "bff518fb-6da7-4dfc-ae4c-bd3f641028e2"
##  [13] "TCGA-24-1557"                        
##  [14] "TCGA-13-0792"                        
##  [15] "TCGA-13-2059"                        
##  [16] "TCGA-04-1338"                        
##  [17] "TCGA-25-1322"                        
##  [18] "TCGA-23-1114"                        
##  [19] "TCGA-24-1924"                        
##  [20] "TCGA-13-0794"                        
##  [21] "TCGA-24-2030"                        
##  [22] "TCGA-13-2057"                        
##  [23] "TCGA-23-1122"                        
##  [24] "TCGA-13-1408"                        
##  [25] "TCGA-13-0885"                        
##  [26] "TCGA-13-0893"                        
##  [27] "TCGA-61-2612"                        
##  [28] "TCGA-23-2649"                        
##  [29] "TCGA-29-1688"                        
##  [30] "TCGA-42-2590"                        
##  [31] "TCGA-13-2061"                        
##  [32] "TCGA-20-1683"                        
##  [33] "TCGA-13-1505"                        
##  [34] "TCGA-24-1469"                        
##  [35] "TCGA-29-1761"                        
##  [36] "0009b464-b376-4fbc-8a56-da538269a02f"
##  [37] "IM_195"                              
##  [38] "86f23897-dba0-4e89-8381-d174eaa6fcc1"
##  [39] "IM_80"                               
##  [40] "TCGA-25-2393"                        
##  [41] "d4bf6034-aeae-48a6-907b-10e2cc904015"
##  [42] "e45f3391-2e74-4767-817a-280cebac7c57"
##  [43] "f858d813-f3c5-4ad9-8c20-9f231d6624d8"
##  [44] "e4aaca83-3ae9-47f6-a975-c144767ad705"
##  [45] "d67cad13-e849-48b0-926c-10b6046ba0b9"
##  [46] "TCGA-13-1495"                        
##  [47] "TCGA-13-0891"                        
##  [48] "TCGA-24-2026"                        
##  [49] "TCGA-24-2023"                        
##  [50] "TCGA-31-1953"                        
##  [51] "09508a0d-ebe0-4fa1-b7b2-1710814181cd"
##  [52] "TCGA-09-1673"                        
##  [53] "IM_107"                              
##  [54] "TCGA-29-1702"                        
##  [55] "TCGA-13-0724"                        
##  [56] "TCGA-24-1549"                        
##  [57] "89dad92e-5b3f-479a-a6da-a94ee7df7f8a"
##  [58] "b243adb4-b3e7-4e0e-bc0d-625aa8dbb1be"
##  [59] "TCGA-04-1348"                        
##  [60] "TCGA-20-0987"                        
##  [61] "TCGA-61-2000"                        
##  [62] "2a8d63eb-0174-4213-9214-413f391f512c"
##  [63] "8b28f6d2-4b7d-493b-826e-b119a4fb0cb4"
##  [64] "33ea81f2-db2c-4567-bd7b-4cb9aadfef88"
##  [65] "IM_71"                               
##  [66] "a3135834-3af0-4e98-bc6f-ad8ddf33db80"
##  [67] "f1504811-8363-41e6-b43c-62452b1262d3"
##  [68] "TCGA-09-2044"                        
##  [69] "TCGA-13-1481"                        
##  [70] "TCGA-30-1860"                        
##  [71] "TCGA-36-2542"                        
##  [72] "745b8756-0eab-423f-8cde-e0ff1aaa6596"
##  [73] "IM_81"                               
##  [74] "95fc38ac-2b36-4c46-abbf-8d2d52ff9626"
##  [75] "IM_6"                                
##  [76] "5b560f4c-d2a3-43fa-b394-abef78bdefc1"
##  [77] "6797443c-eb4a-4654-b957-c0056e5a4206"
##  [78] "IM_124"                              
##  [79] "dce54d09-9827-4fe2-abe1-c5b7d528ba7f"
##  [80] "129de5b2-d9b0-4762-9ef8-72d98231fb50"
##  [81] "b97941dd-9844-4db2-9e25-42c725f47d70"
##  [82] "0664753b-7566-41e0-8006-7009c6735406"
##  [83] "9aecfc8f-62ea-4acf-aa00-d1f0fe6c6556"
##  [84] "b75b2663-dcc6-411c-bfcc-574aa33cf388"
##  [85] "46f19b5c-3eba-4b23-a1ab-9748090ca4e5"
##  [86] "TCGA-24-1471"                        
##  [87] "0ae2193f-0d68-485a-b8c2-7568cbcce33e"
##  [88] "TCGA-13-0804"                        
##  [89] "IM_184"                              
##  [90] "TCGA-24-0980"                        
##  [91] "TCGA-61-1733"                        
##  [92] "TCGA-25-1329"                        
##  [93] "TCGA-61-2017"                        
##  [94] "TCGA-13-0889"                        
##  [95] "TCGA-23-1027"                        
##  [96] "TCGA-09-2055"                        
##  [97] "TCGA-29-1776"                        
##  [98] "TCGA-13-0726"                        
##  [99] "TCGA-59-2354"                        
## [100] "669f0e01-28f6-4ed8-bdb5-73f84ea28f78"
## [101] "IM_22"                               
## [102] "35ceba07-0759-4fbe-b076-af821a528cf0"
## [103] "d9e66fc5-9018-4568-b388-c5eb756f7823"
## [104] "IM_149"                              
## [105] "2f2eaecc-6509-423f-b63a-8c3bea1ba4a4"
## [106] "IM_154"                              
## [107] "efec3225-de07-4559-9a90-95223495cc61"
## [108] "42465bbd-289b-4e96-98fe-76809c5e1520"
## [109] "3c2b1509-1eb9-4b79-9569-57810f291499"
## [110] "80f02aec-c07a-4bcb-b547-e60f8c33a7b1"
## [111] "58faf969-bf37-4180-8807-2f44f2cc8eda"
## [112] "bc9b66f5-fcb8-4545-ab2d-438bb810edc0"
## [113] "6f981023-4269-4e8e-a4ab-2c92bb27273c"
## [114] "5d922e48-aa70-454d-9417-c9af686feebc"
## [115] "e9d98643-01ee-40c3-a617-e004559625cd"
## [116] "IM_67"                               
## [117] "4679f37a-4f09-449b-a1d8-1f02847996da"
## [118] "TCGA-24-1563"                        
## [119] "TCGA-13-1509"                        
## [120] "TCGA-61-1900"                        
## [121] "TCGA-61-2018"                        
## [122] "f988e698-9b34-45ce-ba4c-74e06e9cae4a"
## [123] "IM_160"                              
## [124] "dbbd54db-4470-4df6-b5be-3e175c7133b0"
## [125] "efbec43c-0c16-4006-abe8-c3ec2ec42c05"
## [126] "c2ec7f57-8510-4bbf-a2e9-dbd9ce8dcad1"
## [127] "e84debc4-b47d-48ed-a0d0-2859f0ebf987"
## [128] "c9959f68-c385-4c1f-9188-8203844d288e"
## [129] "42af8f74-fd4b-486d-bc11-db53cc471d62"
## [130] "8658f4f5-9a50-4195-8ea3-227951977647"
## [131] "2b40a733-7a63-4bb8-a953-95a4ee28f962"
## [132] "7a921087-8e62-4a93-a757-fd8cdbe1eb8f"
## [133] "f6c811ff-f22e-490b-9b23-b527d20e6e6d"
## [134] "2b4feb84-89e4-4c38-8561-5ffab02c8132"
## [135] "51b25b37-f75c-4380-a0f6-5273e0b7ee33"
## [136] "f6189828-eeaa-4d21-b163-53bf3d47a640"
## [137] "127b0f7d-d24e-48b7-ac25-d3f14a43952d"
## [138] "TCGA-57-1992"                        
## [139] "2c9dc04b-e9ec-4cf1-ab2c-a18edb30dd37"
## [140] "IM_83"                               
## [141] "8a69f3ca-4e17-4daa-8722-a36316e345ba"
## Number of organoids in underrepresented and represented split: 0 18

There are two types of population which are not represented:

  • On the right of the PCA, right on the barplot above)
    • Relative higher s3: the non-represented samples on the right have a very high exposure of s3. Organoids in general don’t have such high exposures.
    • Relative lower s4
  • On the bottom left of the PCA (left of barplot above)
    • s1: A fraction of underrepresented samples have extremely low s1
  • In general, relative higher s5 in the underrepresented samples (supported by from loadings of PC2, and from the CLR of the samples in the first split of the dendrogram). Organoids have in general a very low exposure of s5.

These are the exposures for some signatures, in the PCA projection.

To make sure this is not due to the type of signatures we are using (since the array ones have more zeros)

## Fraction of samples with any zero in underrepresented:  0.3404255 
## Fraction of samples with any zero in represented:  0.7223199
give_pca = function(data_matrix, center = T, title='', names, names_bool=T, give_loadings=F, print_labels=T, groups=NULL,groups_shape=NULL, nrow_legend=3, size_points=1, print_both_labels=FALSE){
  prcomp_res = prcomp(data_matrix, scale. = TRUE, center = center)
  eigs <- prcomp_res$sdev^2
  if(!names_bool){
    names=NA
  }
  
  if(give_loadings){
    if(!is.null(groups_shape)){
      stop('Groups shape not yet implemented for loadings')
    }
    a = ggplot()+
      geom_point(data=cbind.data.frame(prcomp_res$x[,1:2], names=names), aes(x=PC1, y=PC2), size=size_points)+
      geom_segment(data=cbind.data.frame(prcomp_res$rotation[,1:2], names=paste0('n', 1:ncol(data_matrix))),
                   aes(x=0, y=0, xend=PC1*4, yend=PC2*4), col='red', arrow = arrow(length = unit(0.03, "npc")))+
      labs(x=paste0('PC1 (', round(100*eigs[1]/sum(eigs), 2), '%)' ),
           y=paste0('PC2 (', round(100*eigs[2]/sum(eigs), 2), '%)' ))+ggtitle(title)
    if(print_labels){
      a = a+geom_label_repel(data=cbind.data.frame(prcomp_res$rotation[,1:2], names=paste0('n', 1:ncol(data_matrix))),
                              aes(x=PC1*3, y=PC2*3, label=names), size=3, col='red')
      if(print_both_labels){
        a = a+geom_label_repel(data=cbind.data.frame(prcomp_res$x[,1:2], names=rownames(prcomp_res$x)),
                              aes(x=PC1, y=PC2, label=names), size=3, col='black')
      }
    }
    a
  }else{
    if(!is.null(groups)){
      df = cbind.data.frame(prcomp_res$x[,1:2], names=names, groups=as.factor(groups))
    }else{
      df = cbind.data.frame(prcomp_res$x[,1:2], names=names)
    }
    if(!is.null(groups_shape)){
      df = cbind(df, groups_shape=groups_shape)
    }
    a = ggplot(df,
               aes(x=PC1, y=PC2,label=gsub('Sample ', 'PDO', names)))+
      labs(x=paste0('PC1 (', round(100*eigs[1]/sum(eigs), 2), '%)' ),
           y=paste0('PC2 (', round(100*eigs[2]/sum(eigs), 2), '%)' ))+ggtitle(title)
    if(!is.null(groups)){
      if(!is.null(groups_shape)){
        a = a +geom_point(aes(col=groups, shape=groups_shape), size=size_points)
      }else{
        a = a +geom_point(aes(col=groups), size=size_points)
      }
      a = a + theme(legend.position = "bottom",
            legend.key.size = unit(0.3, "cm"),
            legend.key.width = unit(0.2,"cm"),
            legend.title = element_blank())+
        guides(col=guide_legend(nrow=nrow_legend,byrow=TRUE))
    }else{
      if(!is.null(groups_shape)){
        a = a + geom_point(aes(shape=groups_shape), size=size_points)
      }else{
        a = a + geom_point(size=size_points)
      }
    }

    
    if(print_labels){
      a = a+geom_label_repel(size=3,)
    }
    a
  }
}

PCA of a uniform sample from the 3-dimensional simplex (i.e. four parts), with three different transformations

PCA of organoids with three variations

Plot the loadings of the organoids (PCA only with organoids)

grouping_cohort_all_basic = c(natgen_metadata[[which_natgen]]$study, rep('organoids', nrow(org)))
grouping_cohort_all_basic[grouping_cohort_all_basic != "organoids"] = "primary"
  • PCA of only-nonzero organoids

  • PCA of all samples, with three variations. CLR is robust.

  • PCA of all samples, with three variations, and with an addition of 1e-4 to all samples. CLR, and only CLR, changes notably.

  • PCA of all samples, with three variations, and with an inputation of 1e-4 in all samples. As expected, the results are essentially the same as adding the same value to all elements (zero or non-zero).

  • PCA of all samples, with three variations, and with an addition of 1e-2 to all samples

s1 seems not to be of importance in the loadings of this last PCA. Use it as baseline for PCA (good, too, because it’s almost always strictly positive).

## Creating plot... it might take some time if the data are large. Number of samples: 18

pairs( (org/org[,1])[,-1])

pairs( (all_natgen[[which_natgen]]/all_natgen[[which_natgen]][,1])[,-1])

pairs( (normalise_rw(all_natgen[[which_natgen]][,-5])/all_natgen[[which_natgen]][,1])[,-1])

df_correlations_exposures_organoids = data.frame(ALR_bsS1_s2=org[,2]/org[,1],
                  ALR_bsS1_s4=org[,4]/org[,1],
                  ALR_bsS1_s6=org[,6]/org[,1])
grid.arrange(ggplot(df_correlations_exposures_organoids, aes(x=ALR_bsS1_s2,y=ALR_bsS1_s4))+
               geom_point()+geom_smooth(method = "lm")+labs(x='ALR_s1(s2)', y='ALR_s1(s4)'),
             ggplot(df_correlations_exposures_organoids, aes(x=ALR_bsS1_s2,y=ALR_bsS1_s6))+
               geom_point()+geom_smooth(method = "lm")+labs(x='ALR_s1(s2)', y='ALR_s1(s6)'), nrow=1)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).

org_rename = org
rownames(org_rename) = gsub("Sample ", "PDO", rownames(org))
grid.arrange(createBarplot(cbind(org_rename[,c(1,3,5,7)], sum=rowSums(org_rename[,c(2,4,6)])), remove_labels = FALSE, order_labels = names(sort(org_rename[,1]))) + 
    labs(y='Exposure')+ ggtitle('Exposures for the organoids')+theme(axis.text.x = element_text(angle = 45))+theme(legend.position = "bottom"),
createBarplot(normalise_rw(cbind(org_rename[,c(1,3,7)], sum=rowSums(org_rename[,c(2,4,6)]))), remove_labels = FALSE, order_labels = names(sort(org_rename[,1]))) + 
    labs(y='Exposure')+ ggtitle('Exposures for the organoids')+theme(axis.text.x = element_text(angle = 45))+theme(legend.position = "bottom"), nrow=1)

## Creating plot... it might take some time if the data are large. Number of samples: 18
## Creating plot... it might take some time if the data are large. Number of samples: 18
grid.arrange(give_pca(normalise_rw(cbind(org_rename[,c(1,3,7)], sum=rowSums(org_rename[,c(2,4,6)]))), names=rownames(org_rename)),
             give_pca(as(compositions::clr(normalise_rw(cbind(org_rename[,c(1,3,7)], sum=rowSums(org_rename[,c(2,4,6)])))), 'matrix'), names=rownames(org_rename)),
             give_pca(as(compositions::clr(normalise_rw(cbind(org_rename[,c(1,3,7)], sum=rowSums(org_rename[,c(2,4,6)])))), 'matrix'), names=rownames(org_rename),
                      give_loadings = T), nrow=1)

Note: s3 and s4 are consistently shown in the same PC axis, in opposite directions.

df_correlations_exposures_cohorts = data.frame(ALR_bsS1_s2=all_natgen[[which_natgen]][,2][1:nrow(natgen_metadata[[which_natgen]])]/all_natgen[[which_natgen]][,1][1:nrow(natgen_metadata[[which_natgen]])],
           ALR_bsS1_s4=all_natgen[[which_natgen]][,4][1:nrow(natgen_metadata[[which_natgen]])]/all_natgen[[which_natgen]][,1][1:nrow(natgen_metadata[[which_natgen]])],
          ALR_bsS1_s6=all_natgen[[which_natgen]][,6][1:nrow(natgen_metadata[[which_natgen]])]/all_natgen[[which_natgen]][,1][1:nrow(natgen_metadata[[which_natgen]])],
                  colour=natgen_metadata[[which_natgen]]$study[1:nrow(natgen_metadata[[which_natgen]])]
                  )
grid.arrange(ggplot(df_correlations_exposures_cohorts, aes(x=ALR_bsS1_s2,y=ALR_bsS1_s4))+
               geom_point()+geom_smooth(method = "lm")+labs(x='ALR_s1(s2)', y='ALR_s1(s4)'),
             ggplot(df_correlations_exposures_cohorts, aes(x=ALR_bsS1_s2,y=ALR_bsS1_s6))+
               geom_point()+geom_smooth(method = "lm")+labs(x='ALR_s1(s2)', y='ALR_s1(s6)'), nrow=1)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 33 rows containing non-finite values (stat_smooth).
## Warning: Removed 7 rows containing missing values (geom_point).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 33 rows containing non-finite values (stat_smooth).
## Warning: Removed 18 rows containing missing values (geom_point).

  • Inputation of 1e-2, and without s5 (first three plots), and with robust CLR wihout imputation (last plot)

Pretty interesting plot, as it seems to show four driving forces whicc are orthogonal two-on-two. n1 (s1) and s2 (s3) are anticorrelated and driving most of the variance in organoids. The sum of s2,4,6, is orthogonal to both s1 and s3, and causes most subsequent variation.

## Warning: position_stack requires non-overlapping x intervals

## Creating plot... it might take some time if the data are large. Number of samples: 18

Does the same happen in the cohorts?

give_pca(data_matrix = as(compositions::clr(normalise_rw(cbind(natgen[[which_natgen]][,c(1,3,7)],
                                                                sum=rowSums(natgen[[which_natgen]][,c(2,4,6)])))), 'matrix'),
         names=rownames(natgen[[which_natgen]]), give_loadings = T, print_labels = T, print_both_labels=F)

Some comments on the PCAs

  • Whenever an imputation for zero values is used, a large fraction of organoid samples appear as outliers, and so do the other TCGA samples (which are the other samples that contain zeros)
  • If the robust CLR PCA is used, there is no such distinction
  • If the raw values are used (which I think there should be no problem using), there is no big difference either.

Hierarchical clustering

give_dendrogram_generalised = function(df, modify_labels=T, keep_only_PDO){
  if(keep_only_PDO & modify_labels){stop('Only one can be true: keep_only_PDO, modify_labels')}
  
  x = hclust(dist(df))
  if(modify_labels){
    x$labels[!grepl("PDO", x$labels)] = ''
    x$labels[grepl("PDO", x$labels)] = '*'
  }
  
  if(keep_only_PDO){
    x$labels[!grepl("PDO", x$labels)] = ''
  }
  
  y = x
  y$labels = gsub("Sample ", "PDO", x$labels)
  # plot(x)

  labelCol <- function(x) {
    if (is.leaf(x)) {
      ## fetch label
      label <- attr(x, "label")
      ## set label color to red for A and B, to blue otherwise
      attr(x, "nodePar") <- list(lab.col="blue")
    }
    return(x)
  }
  d <- dendrapply(as.dendrogram(x), labelCol)
  plot(d)
  return(y)

}

Plot showing differences in amalgamation ratios

par(mfrow=c(1,2))
dendroraw = give_dendrogram_generalised(all_natgen[[which_natgen]], modify_labels = F, keep_only_PDO = T)
dendroclr = give_dendrogram_generalised(as(compositions::clr(all_natgen[[which_natgen]]), 'matrix'), modify_labels = F, keep_only_PDO=T)

par(mfrow=c(1,4), mar=c(0.2,0.2,0.2,0.2))
dendroraw_pdo = give_dendrogram_generalised(all_natgen[[which_natgen]][grepl('PDO', rownames(all_natgen[[which_natgen]])),], modify_labels=F, keep_only_PDO = T)
dendroclr_pdo = give_dendrogram_generalised(as(compositions::clr(all_natgen[[which_natgen]][grepl('PDO', rownames(all_natgen[[which_natgen]])),]), 'matrix'), modify_labels=F, keep_only_PDO = T)
dendroimput_pdo = give_dendrogram_generalised(impute(all_natgen[[which_natgen]][grepl('PDO', rownames(all_natgen[[which_natgen]])),], 1e-2), modify_labels=F, keep_only_PDO = T)
dendroimputclr_pdo = give_dendrogram_generalised(as(compositions::clr(impute(all_natgen[[which_natgen]][grepl('PDO', rownames(all_natgen[[which_natgen]])),], 1e-2)), 'matrix'), modify_labels=F, keep_only_PDO = T)

dendroimputclr_all = give_dendrogram_generalised(as(compositions::clr(impute(all_natgen[[which_natgen]], 1e-2)), 'matrix'), modify_labels=F, keep_only_PDO = F)

library(dendextend) # for comparing two dendrograms
dend1comparison <- as.dendrogram (dendroraw_pdo)
dend2comparison <- as.dendrogram (dendroclr_pdo)
dend3comparison <- as.dendrogram (dendroimput_pdo)
dend4comparison <- as.dendrogram (dendroimputclr_pdo)

tanglegram(dend1comparison, dend2comparison)
tanglegram(dend1comparison, dend3comparison)
tanglegram(dend1comparison, dend4comparison)
tanglegram(dend2comparison, dend4comparison)
give_dendrogram_generalised(as(compositions::clr(impute(all_natgen[[which_natgen]][grepl('PDO', rownames(all_natgen[[which_natgen]])),], 1e-2)), 'matrix'), modify_labels=F, keep_only_PDO = T)

## 
## Call:
## hclust(d = dist(df))
## 
## Cluster method   : complete 
## Distance         : euclidean 
## Number of objects: 18
dend_data_inputclr <- dendro_data(dendroimputclr_all, type = "rectangle")
dend_data_inputclr$labels$label = as.character(dend_data_inputclr$labels$label)
dend_data_inputclr$labels$label[!grepl('PDO', dend_data_inputclr$labels$label)] = ""
p_v2 <- ggplot(dend_data_inputclr$segments) +
  geom_segment(aes(x = x, y = y, xend = xend, yend = yend))+
  # geom_text(data = dend_data$labels, aes(x, y, label = label),
  #           hjust = 1, angle = 90, size = 3)+
  geom_label_repel(data = dend_data_inputclr$labels, aes(x, y, label = gsub('Organoid ', '', label)),
            hjust = 0, size = 3, vjust=0, nudge_y = -2)+
  ylim(-3, 15)+
  theme_bw()+
  theme(axis.title.x=element_blank(),
    axis.text.x=element_blank(),
    axis.ticks.x=element_blank(),
    axis.title.y=element_blank(),
    axis.text.y=element_blank(),
    axis.ticks.y=element_blank(),
    panel.grid = element_blank(),
    panel.border = element_blank(),
    panel.background = element_blank())+
    scale_x_continuous(expand = c(extra_expand, extra_expand))+
  scale_y_continuous(expand = c(0.5, 0, 0.05, 0))
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
# print(p)

heatmap_dendrogram_df_inputclr = t(all_natgen$UpdatedExposures[rownames(all_natgen$UpdatedExposures)[match(gsub("Organoid ", "", labels(dendroimputclr_all)),rownames(all_natgen$UpdatedExposures))],])
# heatmap_dendrogram_df_inputclr = colnames(heatmap_dendrogram_df)
# heatmap_dendrogram_df_inputclr[!grepl('Sample',heatmap_dendrogram_df_inputclr)] = ""
# heatmap_dendrogram_df_inputclr[grepl('Sample',heatmap_dendrogram_df_inputclr)] = "*"

ggthemr::ggthemr("pale")
extra_expand_v2 = .040
p2_inputclr = ggplot(melt(heatmap_dendrogram_df_inputclr), aes(x=Var2, y=value, fill=Var1))+geom_bar(stat='identity')+theme_bw()+
    theme(axis.title.x=element_blank(),  legend.title=element_blank(),
          legend.text=element_text(size=10),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),  legend.position = "bottom",
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        panel.border = element_blank())+
  scale_fill_brewer(palette="Dark2")+
      scale_x_discrete(expand = c(extra_expand_v2, extra_expand_v2))+
  guides(fill = guide_legend(nrow = 1))
# 

pdf("figures/barplot_dendrogam_impute10emin2_clr.pdf", width=10, height = 4)
grid.arrange(p_v2, p2_inputclr, heights=c(2,1))
dev.off()
## quartz_off_screen 
##                 2
grid.arrange(p_v2, p2_inputclr, heights=c(2,1))

# grid.arrange(p, p3, heights=c(2,1))
# grid.arrange(p, p3, heights=c(2,1))

Analysis of CN profiles

additional genomic data comparing the tumours to the organoids in terms of ploidy, number of rearrangements and any other things that you think could be relevant

Load the segments

Distribution of features

pcawg_CN_features = readRDS("data/pcawg_CN_features.rds")
tcga_CN_features = readRDS("data/tcga_CN_features.rds")

BriTROC_absolute_copynumber = readRDS("data/BriTROC_absolute_copynumber.rds")
BriTROC2_CN_features = readRDS("data/6_TCGA_Signatures_on_BRITROC/0_BRITROC_absolute_CN.rds")

organoids_absolute_copynumber = readRDS("data/organoid_absolute_CN.rds")
sampleNames(organoids_absolute_copynumber) = names_orgs$`new name`[match(gsub("org", "", sampleNames(organoids_absolute_copynumber)), names_orgs$`old name`)]
organoids_CN_features = extractCopynumberFeatures(organoids_absolute_copynumber)

BriTROC_CN_features = readRDS("data/BriTROC_CN_features.rds")

The number of segments can be taken either from segsize (first column) or from copynumber (last column). This is just for PCAWG and TCGA! Not for BriTROC. Any idea why this is the case?

** Note: I am plotting this as the log!**

** Note 2: I am pooling together segments from several samples and analysing them all together! I.e. I have not averaged over samples **

## quartz_off_screen 
##                 2

Tests for equality of means of the distribution of features (average of segments within a sample)

  • I have computed the average value for each feature for each sample. Then I have compared the distribution of means of the patient cohorts all together, versus the distribution of means of the organoids
  • This is different from the approach above, where we were pooling segments for all samples (i.e. not keeping patient information)
  • In all cases I have log-transformed the data, which ensues normality, to compute a p-value for the difference in means
  • I am using the default r t-test, which is the Welch t-test, which does not assume equality of variances
  • In all cases there is no statistical significance between organoids the cohorts, indicating a similar distribution (at least when it comes to their means)
  • The only feature which is truly problematic is copy number, which very evidently shows a bimodal distribution (since there are the normal segments, or segments of copy number 1, and the amplified segments, which appear as two populations)
distrib_segsize = create_distrib_df('segsize', 'value')
distrib_bp10MB = create_distrib_df('bp10MB', 'value')
distrib_osCN = create_distrib_df('osCN', 'value')
distrib_bpchrarm = create_distrib_df('bpchrarm', 'ct1')
distrib_changepoint = create_distrib_df('changepoint', 'value')
distrib_copynumber = create_distrib_df('copynumber', 'value')

#------------------------------------------------------------------------------------------------#

## 1/6 breakpoints per 10MB
grid.arrange(give_joint_histogram(list(distrib_bp10MB[['org']]$`mean(ct1_num)`,
                          distrib_bp10MB[['pcawg']]$`mean(ct1_num)`,
                          distrib_bp10MB[['tcga']]$`mean(ct1_num)`,
                          distrib_bp10MB[['BriTROC']]$`mean(ct1_num)`), no_colour=FALSE),
give_joint_histogram(list(distrib_bp10MB[['org']]$`mean(ct1_num)` %>% log,
                          distrib_bp10MB[['pcawg']]$`mean(ct1_num)` %>% log,
                          distrib_bp10MB[['tcga']]$`mean(ct1_num)` %>% log,
                          distrib_bp10MB[['BriTROC']]$`mean(ct1_num)` %>% log), no_colour=FALSE)+ggtitle('Log transform'), ncol=2)

t.test(distrib_bp10MB[['org']]$`mean(ct1_num)` %>% log,
       c(distrib_bp10MB[['pcawg']]$`mean(ct1_num)` %>% log,
         distrib_bp10MB[['tcga']]$`mean(ct1_num)` %>% log,
         distrib_bp10MB[['BriTROC']]$`mean(ct1_num)` %>% log))
## 
##  Welch Two Sample t-test
## 
## data:  distrib_bp10MB[["org"]]$`mean(ct1_num)` %>% log and c(distrib_bp10MB[["pcawg"]]$`mean(ct1_num)` %>% log, distrib_bp10MB[["tcga"]]$`mean(ct1_num)` %>% distrib_bp10MB[["org"]]$`mean(ct1_num)` %>% log and     log, distrib_bp10MB[["BriTROC"]]$`mean(ct1_num)` %>% log)
## t = -0.52491, df = 18.513, p-value = 0.6059
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3415468  0.2047773
## sample estimates:
##  mean of x  mean of y 
## -0.8415706 -0.7731859
#------------------------------------------------------------------------------------------------#
## 2/6 segment size
grid.arrange(give_joint_histogram(list(distrib_segsize[['org']]$`mean(ct1_num)`,
                          distrib_segsize[['pcawg']]$`mean(ct1_num)`,
                          distrib_segsize[['tcga']]$`mean(ct1_num)`,
                          distrib_segsize[['BriTROC']]$`mean(ct1_num)`), no_colour=FALSE)+ggtitle('Raw'),
give_joint_histogram(list(distrib_segsize[['org']]$`mean(ct1_num)` %>% log,
                          distrib_segsize[['pcawg']]$`mean(ct1_num)` %>% log,
                          distrib_segsize[['tcga']]$`mean(ct1_num)` %>% log,
                          distrib_segsize[['BriTROC']]$`mean(ct1_num)` %>% log), no_colour=FALSE)+ggtitle('Log transform'), ncol=2)

t.test(distrib_segsize[['org']]$`mean(ct1_num)` %>% log,
       c(distrib_segsize[['pcawg']]$`mean(ct1_num)` %>% log,
         distrib_segsize[['tcga']]$`mean(ct1_num)` %>% log,
         distrib_segsize[['BriTROC']]$`mean(ct1_num)` %>% log))
## 
##  Welch Two Sample t-test
## 
## data:  distrib_segsize[["org"]]$`mean(ct1_num)` %>% log and c(distrib_segsize[["pcawg"]]$`mean(ct1_num)` %>% log, distrib_segsize[["tcga"]]$`mean(ct1_num)` %>% distrib_segsize[["org"]]$`mean(ct1_num)` %>% log and     log, distrib_segsize[["BriTROC"]]$`mean(ct1_num)` %>% log)
## t = 0.77176, df = 18.338, p-value = 0.4501
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1435866  0.3106786
## sample estimates:
## mean of x mean of y 
##  16.69896  16.61541
#------------------------------------------------------------------------------------------------#
## 3/6 oscillating CN
grid.arrange(give_joint_histogram(list(distrib_osCN[['org']]$`mean(ct1_num)`,
                          distrib_osCN[['pcawg']]$`mean(ct1_num)`,
                          distrib_osCN[['tcga']]$`mean(ct1_num)`,
                          distrib_osCN[['BriTROC']]$`mean(ct1_num)`), no_colour=FALSE)+ggtitle('Raw'),
give_joint_histogram(list(distrib_osCN[['org']]$`mean(ct1_num)` %>% log,
                          distrib_osCN[['pcawg']]$`mean(ct1_num)` %>% log,
                          distrib_osCN[['tcga']]$`mean(ct1_num)` %>% log,
                          distrib_osCN[['BriTROC']]$`mean(ct1_num)` %>% log), no_colour=FALSE)+ggtitle('Log transform'), ncol=2)
## Warning: Removed 15 rows containing non-finite values (stat_bin).

t.test(distrib_osCN[['org']]$`mean(ct1_num)` %>% log,
       remove_infty(c(distrib_osCN[['pcawg']]$`mean(ct1_num)` %>% log,
         distrib_osCN[['tcga']]$`mean(ct1_num)` %>% log,
         distrib_osCN[['BriTROC']]$`mean(ct1_num)` %>% log)))
## 
##  Welch Two Sample t-test
## 
## data:  distrib_osCN[["org"]]$`mean(ct1_num)` %>% log and remove_infty(c(distrib_osCN[["pcawg"]]$`mean(ct1_num)` %>% log, distrib_osCN[["org"]]$`mean(ct1_num)` %>% log and     distrib_osCN[["tcga"]]$`mean(ct1_num)` %>% log, distrib_osCN[["BriTROC"]]$`mean(ct1_num)` %>% distrib_osCN[["org"]]$`mean(ct1_num)` %>% log and         log))
## t = 0.36511, df = 18.948, p-value = 0.7191
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1355887  0.1928764
## sample estimates:
##  mean of x  mean of y 
## -0.5555398 -0.5841836
#------------------------------------------------------------------------------------------------#
## 4/6 num of breakpoints per chromosome arm
grid.arrange(give_joint_histogram(list(distrib_bpchrarm[['org']]$`mean(ct1_num)`,
                          distrib_bpchrarm[['pcawg']]$`mean(ct1_num)`,
                          distrib_bpchrarm[['tcga']]$`mean(ct1_num)`,
                          distrib_bpchrarm[['BriTROC']]$`mean(ct1_num)`), no_colour=FALSE)+ggtitle('Raw'),
give_joint_histogram(list(distrib_bpchrarm[['org']]$`mean(ct1_num)` %>% log,
                          distrib_bpchrarm[['pcawg']]$`mean(ct1_num)` %>% log,
                          distrib_bpchrarm[['tcga']]$`mean(ct1_num)` %>% log,
                          distrib_bpchrarm[['BriTROC']]$`mean(ct1_num)` %>% log), no_colour=FALSE)+ggtitle('Log transform'), ncol=2)

## Same distribution of num of breakpoints per chromosome arm
t.test(distrib_bpchrarm[['org']]$`mean(ct1_num)` %>% log,
       c(distrib_bpchrarm[['pcawg']]$`mean(ct1_num)` %>% log,
         distrib_bpchrarm[['tcga']]$`mean(ct1_num)` %>% log,
         distrib_bpchrarm[['BriTROC']]$`mean(ct1_num)` %>% log))
## 
##  Welch Two Sample t-test
## 
## data:  distrib_bpchrarm[["org"]]$`mean(ct1_num)` %>% log and c(distrib_bpchrarm[["pcawg"]]$`mean(ct1_num)` %>% log, distrib_bpchrarm[["tcga"]]$`mean(ct1_num)` %>% distrib_bpchrarm[["org"]]$`mean(ct1_num)` %>% log and     log, distrib_bpchrarm[["BriTROC"]]$`mean(ct1_num)` %>% log)
## t = -0.53247, df = 18.515, p-value = 0.6007
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3425379  0.2037960
## sample estimates:
## mean of x mean of y 
##  1.078022  1.147393
#------------------------------------------------------------------------------------------------#
## 5/6 num of changepoints
grid.arrange(give_joint_histogram(list(distrib_changepoint[['org']]$`mean(ct1_num)`,
                          distrib_changepoint[['pcawg']]$`mean(ct1_num)`,
                          distrib_changepoint[['tcga']]$`mean(ct1_num)`,
                          distrib_changepoint[['BriTROC']]$`mean(ct1_num)`), no_colour=FALSE)+ggtitle('Raw'),
give_joint_histogram(list(distrib_changepoint[['org']]$`mean(ct1_num)` %>% log,
                          distrib_changepoint[['pcawg']]$`mean(ct1_num)` %>% log,
                          distrib_changepoint[['tcga']]$`mean(ct1_num)` %>% log,
                          distrib_changepoint[['BriTROC']]$`mean(ct1_num)` %>% log), no_colour=FALSE)+ggtitle('Log transform'), ncol=2)

## Same distribution of num of changepoints
t.test(distrib_changepoint[['org']]$`mean(ct1_num)` %>% log,
       c(distrib_changepoint[['pcawg']]$`mean(ct1_num)` %>% log,
         distrib_changepoint[['tcga']]$`mean(ct1_num)` %>% log,
         distrib_changepoint[['BriTROC']]$`mean(ct1_num)` %>% log))
## 
##  Welch Two Sample t-test
## 
## data:  distrib_changepoint[["org"]]$`mean(ct1_num)` %>% log and c(distrib_changepoint[["pcawg"]]$`mean(ct1_num)` %>% log, distrib_changepoint[["tcga"]]$`mean(ct1_num)` %>% distrib_changepoint[["org"]]$`mean(ct1_num)` %>% log and     log, distrib_changepoint[["BriTROC"]]$`mean(ct1_num)` %>% distrib_changepoint[["org"]]$`mean(ct1_num)` %>% log and     log)
## t = 0.19539, df = 18.554, p-value = 0.8472
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1691729  0.2039479
## sample estimates:
## mean of x mean of y 
## 0.4200699 0.4026824
#------------------------------------------------------------------------------------------------#
## 6/6 copy number
grid.arrange(give_joint_histogram(list(distrib_copynumber[['org']]$`mean(ct1_num)`,
                          distrib_copynumber[['pcawg']]$`mean(ct1_num)`,
                          distrib_copynumber[['tcga']]$`mean(ct1_num)`,
                          distrib_copynumber[['BriTROC']]$`mean(ct1_num)`), no_colour=FALSE)+ggtitle('Raw'),
give_joint_histogram(list(distrib_copynumber[['org']]$`mean(ct1_num)` %>% log,
                          distrib_copynumber[['pcawg']]$`mean(ct1_num)` %>% log,
                          distrib_copynumber[['tcga']]$`mean(ct1_num)` %>% log,
                          distrib_copynumber[['BriTROC']]$`mean(ct1_num)` %>% log), no_colour=FALSE)+ggtitle('Log transform'), ncol=2)

## Same distribution of copy number of segments
t.test(distrib_copynumber[['org']]$`mean(ct1_num)` %>% log,
       c(distrib_copynumber[['pcawg']]$`mean(ct1_num)` %>% log,
         distrib_copynumber[['tcga']]$`mean(ct1_num)` %>% log,
         distrib_copynumber[['BriTROC']]$`mean(ct1_num)` %>% log))
## 
##  Welch Two Sample t-test
## 
## data:  distrib_copynumber[["org"]]$`mean(ct1_num)` %>% log and c(distrib_copynumber[["pcawg"]]$`mean(ct1_num)` %>% log, distrib_copynumber[["tcga"]]$`mean(ct1_num)` %>% distrib_copynumber[["org"]]$`mean(ct1_num)` %>% log and     log, distrib_copynumber[["BriTROC"]]$`mean(ct1_num)` %>% distrib_copynumber[["org"]]$`mean(ct1_num)` %>% log and     log)
## t = 0.35387, df = 18.18, p-value = 0.7275
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1286706  0.1808402
## sample estimates:
## mean of x mean of y 
##  1.160435  1.134350

Explanation: The distribution of sample-averaged values of features are the sample between organoids and the cohorts of primary tissue samples for the number of breakpoints per 10MB (Welch Two Sample t-test on log-transformed data, p-value = 0.6059), segment size (Welch Two Sample t-test on log-transformed data, p-value = 0.4501 ), oscillating copy number (Welch Two Sample t-test on log-transformed data, p-value = 0.7191), number of breakpoints per chromosome arm (Welch Two Sample t-test on log-transformed data, p-value = 0.6007), number of changepoints (Welch Two Sample t-test on log-transformed data, p-value = 0.8472), and the copy number of the segments (Welch Two Sample t-test on log-transformed data, p-value = 0.7275).

Number of segments; Poisson and NegBin GLM

TL;DR with a negative binomial model, which is much more appropriate in this setting than a Poisson, there is no difference in the distributions of organoids and non-organodis when it comes to number of segments.

## Analysis of Deviance Table
## 
## Model 1: length ~ 1
## Model 2: length ~ bool
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       824      58116                          
## 2       823      58024  1   91.318 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Warning in anova.negbin(reduced_nb, full_nb, test = "LRT"): only Chi-squared LR
## tests are implemented
## Likelihood ratio tests of Negative Binomial Models
## 
## Response: length
##   Model    theta Resid. df    2 x log-lik.   Test    df LR stat.   Pr(Chi)
## 1     1 3.076226       824       -9971.807                                
## 2  bool 3.081258       823       -9970.344 1 vs 2     1  1.46259 0.2265185
sapply(list(glm_poisson_length[glm_poisson_length$names == 'organoids','length']), function(i) c(mean(i), sd(i)))
##          [,1]
## [1,] 169.0556
## [2,]  77.4866
sapply(list(glm_poisson_length[glm_poisson_length$names != 'organoids','length']), function(i) c(mean(i), sd(i)))
##          [,1]
## [1,] 200.4002
## [2,] 134.0146

Unfortunately the scaling factor has to do with the width of the bins in the histogram.

Ploidy

To get the ploidy, I just have to compute the weighted average of the copy number segments (this is computed from the absolute copy number profiles objects, since they specify, for each segment, its length and its ploidy).

Use getSegTable to get the segments from this Biobase file

Needs to be ammended: it has to also count areas of the genome for which we don’t have segments! i.e. we need to know the effective genome size

## we only want the ovarian ones
ICGC_absolute_copynumber_AU = readRDS("data/CN_Calls_ABSOLUTE_PCAWG/OV-AU.segments.raw.rds")
ICGC_absolute_copynumber_US = readRDS("data/CN_Calls_ABSOLUTE_PCAWG/OV-US.segments.raw.rds")
ICGC_absolute_copynumber_AU = ICGC_absolute_copynumber_AU[,c('sample', 'chr', 'startpos', 'endpos', 'segVal')]
ICGC_absolute_copynumber_US = ICGC_absolute_copynumber_US[,c('sample', 'chr', 'startpos', 'endpos', 'segVal')]

segtables_ICGC_absolute_copynumber_AU = lapply(sort(unique(ICGC_absolute_copynumber_AU$sample)),
                                            function(samplename)
                                              ICGC_absolute_copynumber_AU[ICGC_absolute_copynumber_AU$sample == samplename,])
segtables_ICGC_absolute_copynumber_AU = lapply(segtables_ICGC_absolute_copynumber_AU, function(i) { colnames(i)[colnames(i) == "chr"] = "chromosome";
colnames(i)[colnames(i) == "endpos"] = "end";
return(i) } )
names(segtables_ICGC_absolute_copynumber_AU) = unique(ICGC_absolute_copynumber_AU$sample)

segtables_ICGC_absolute_copynumber_US = lapply(sort(unique(ICGC_absolute_copynumber_US$sample)),
                                            function(samplename) ICGC_absolute_copynumber_US[ICGC_absolute_copynumber_US$sample == samplename,])
segtables_ICGC_absolute_copynumber_US = lapply(segtables_ICGC_absolute_copynumber_US, function(i) { colnames(i)[colnames(i) == "chr"] = "chromosome";
colnames(i)[colnames(i) == "endpos"] = "end";
return(i) } )
names(segtables_ICGC_absolute_copynumber_US) = unique(ICGC_absolute_copynumber_US$sample)

## for ICGC, remove the samples row and put it in the rows
segtables_ICGC_absolute_copynumber_US = lapply(segtables_ICGC_absolute_copynumber_US, function(i){
  rownames(i) = i$samples
  i = i[,-1]
  i})
segtables_ICGC_absolute_copynumber_AU = lapply(segtables_ICGC_absolute_copynumber_AU, function(i){
  rownames(i) = i$samples
  i = i[,-1]
  i})
## Check that there are no sex chromosomes included anywhere
## [1] TRUE TRUE TRUE TRUE TRUE

In the previous computation of the sample ploidy (ploidy_ICGC_US_previous, etc.) I had not included normal segments. Now I do but actually the result is very similar. (Note slghtly under- and over-estimated values at the right and left of ploidy=2).

plot(log(c(ploidy_ICGC_US_previous, ploidy_ICGC_AU_previous, ploidy_TCGA_previous,
           ploidy_BriTROC_previous, ploidy_organoids_previous)),
     log(c(ploidy_ICGC_US, ploidy_ICGC_AU, ploidy_TCGA, ploidy_BriTROC, ploidy_organoids)),
     ylab='log ploidy of samples with current mean ploidy calculation',
     xlab='log ploidy of samples with previous mean ploidy calculation')
abline(coef=c(0,1), lty='dashed')
abline(v=log(2), lty='dashed')

Ploidy is not normally distributed and it’s right-skewed. Moreover, the distribution is bimodal: I guess there are genomes in which there is a clear amplification and genomes which are more or less normal, so centered around 2.

I am also using a robust linear regression, but I don’t think this is suitable either.

t.test(log(ploidy_organoids), log(ploidy_BriTROC))
## 
##  Welch Two Sample t-test
## 
## data:  log(ploidy_organoids) and log(ploidy_BriTROC)
## t = -0.90962, df = 20.948, p-value = 0.3734
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.16268846  0.06368736
## sample estimates:
## mean of x mean of y 
## 0.9401143 0.9896149
MASS::rlm(ploidy~group,
         data=cbind.data.frame(ploidy=c(ploidy_organoids, ploidy_BriTROC), group=c(rep(1,length(ploidy_organoids)), rep(2, length(ploidy_BriTROC)))))
## Call:
## rlm(formula = ploidy ~ group, data = cbind.data.frame(ploidy = c(ploidy_organoids, 
##     ploidy_BriTROC), group = c(rep(1, length(ploidy_organoids)), 
##     rep(2, length(ploidy_BriTROC)))))
## Converged in 5 iterations
## 
## Coefficients:
## (Intercept)       group 
##   2.4791559   0.1280964 
## 
## Degrees of freedom: 298 total; 296 residual
## Scale estimate: 0.675

Segments across the genome

## Segments across the genome
# (sapply(chrlen$V1, function(i) gsub("chr", "", i)))
sorted_chroms = chrlen$V1[order(as.numeric((gsub("chr", "", chrlen$V1))))]
## Warning in eval(quote(list(...)), env): NAs introduced by coercion
chrom_lenths = chrlen[match(sorted_chroms, chrlen$V1),]

Ploidy

Ranking for the number of copy number events and ploidy

## Warning: Ignoring unknown aesthetics: label, width
## Warning: position_stack requires non-overlapping x intervals
## Warning: Removed 404 rows containing missing values (geom_label_repel).
## Warning: Removed 403 rows containing missing values (geom_label_repel).

## Warning: Ignoring unknown aesthetics: label, width
## Warning: position_stack requires non-overlapping x intervals
## Warning: Removed 461 rows containing missing values (geom_label_repel).
## Warning: Removed 460 rows containing missing values (geom_label_repel).

## Warning: Ignoring unknown aesthetics: label, width
## Warning: position_stack requires non-overlapping x intervals
## Warning: Removed 404 rows containing missing values (geom_label_repel).
## Warning: Removed 403 rows containing missing values (geom_label_repel).
## quartz_off_screen 
##                 2
## Warning: Ignoring unknown aesthetics: label, width
## Warning: position_stack requires non-overlapping x intervals
## Warning: Removed 461 rows containing missing values (geom_label_repel).
## Warning: Removed 460 rows containing missing values (geom_label_repel).
## quartz_off_screen 
##                 2
saveRDS("~/Desktop/cb.RDS", object = cbind.data.frame(value=as.numeric(ploidies$ploidy), L1=gsub("ploidy_", "", as.character(ploidies$cohort)),
                           Var1=rownames(ploidies)))
head(melt(number_of_segments))
##    Var1 value        L1
## 1  PDO1    64 organoids
## 2 PDO10   178 organoids
## 3 PDO11    64 organoids
## 4 PDO12   130 organoids
## 5 PDO13   149 organoids
## 6 PDO14   288 organoids
head(cbind.data.frame(value=as.numeric(ploidies$ploidy), L1=gsub("ploidy_", "", as.character(ploidies$cohort)),
                           Var1=rownames(ploidies)))
##      value    L1                                 Var1
## 1 2.929801 pcawg 052665d1-ab75-4f40-be5a-b88154c8beed
## 2 2.016570 pcawg 0664753b-7566-41e0-8006-7009c6735406
## 3 2.052363 pcawg 0ae2193f-0d68-485a-b8c2-7568cbcce33e
## 4 1.762452 pcawg 127b0f7d-d24e-48b7-ac25-d3f14a43952d
## 5 2.960741 pcawg 1292e13b-d7c6-447b-a227-9a8113215580
## 6 1.621123 pcawg 129de5b2-d9b0-4762-9ef8-72d98231fb50

head(organoids_absolute_copynumber)
## QDNAseqCopyNumbers (storageMode: lockedEnvironment)
## assayData: 1 features, 18 samples 
##   element names: copynumber, segmented 
## protocolData: none
## phenoData
##   sampleNames: PDO16 PDO15 ... PDO11 (18 total)
##   varLabels: name total.reads ... TP53freq (15 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: 1:1-30000
##   fvarLabels: chromosome start ... use (9 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

Comparing amplified/deleted genes between organoids and TCGA

data("geneInfo")
gr_BriTROC <- lapply(segtables_BriTROC_absolute_copynumber, function(i) as(data.frame(i), "GRanges"))
gr_genes_of_interest = as(data.frame(geneInfo[!(geneInfo$chrom %in% c('X', 'Y')), c('chrom', 'start', 'end', 'genename', 'geneid' )]), "GRanges")

# gr_genes_of_interest = as(data.frame(rbind(c(1, 830001, 860001),
#                           geneInfo[sample(1:nrow(geneInfo), 2000),c('chrom', 'start', 'end' )])), "GRanges")

give_ploidy_gene <- function(gene_idx, sample){
  disjoint_segments <- GenomicRanges::disjoin(c(gr_genes_of_interest[gene_idx,], sample), with.revmap=TRUE, ignore.strand=TRUE)
  # GenomicRanges::disjoin(c(gr_genes_of_interest[1,], gr_BriTROC$IM_100[1,]), with.revmap=TRUE, ignore.strand=TRUE)
  # GenomicRanges::disjoin(c(gr_BriTROC$IM_100[1,], gr_genes_of_interest[1,]), with.revmap=FALSE, ignore.strand=TRUE)
  # GenomicRanges::disjoin(c(gr_genes_of_interest[1,], gr_BriTROC$IM_100[1,]), with.revmap=FALSE, ignore.strand=TRUE)
  
  # disjoint_segments = GenomicRanges::reduce(c(gr_genes_of_interest[1,], gr_BriTROC$IM_100[1,]), with.revmap=TRUE, ignore.strand=TRUE)
  # disjoint_segments_b = GenomicRanges::reduce(c(gr_BriTROC$IM_100[1,], gr_genes_of_interest[1,]), with.revmap=TRUE, ignore.strand=TRUE)
  revmap <- mcols(disjoint_segments)$revmap
  r_scores <- extractList(mcols(c(gr_genes_of_interest[gene_idx,], sample))$segVal, revmap)

  return(weighted_ploidy(diploidifNA(sapply(1:length(r_scores), function(idx) if(1 %in% revmap[[idx]]){r_scores[[idx]][2]}else{r_scores[[idx]][1]})[sapply(revmap, function(i) any(1 %in% i))]),
                  width(disjoint_segments)[sapply(revmap, function(i) 1 %in% i)]))
}

all_ranges_cohort = rep(1:5, c(length(gr_BriTROC), length(segtables_ICGC_absolute_copynumber_AU),
                              length(segtables_ICGC_absolute_copynumber_US),
                              length(segtables_TCGA_absolute_copynumber),
                              length(segtables_organoids_absolute_copynumber)))
all_ranges_cohort = c('BriTROC', 'ICGC AU', 'ICGC US', 'TCGA', 'Organoids')[all_ranges_cohort]
all_ranges = c(gr_BriTROC,
               lapply(segtables_ICGC_absolute_copynumber_AU, function(i) as(data.frame(rename_cols(i)), "GRanges")),
               lapply(segtables_ICGC_absolute_copynumber_US, function(i) as(data.frame(rename_cols(i)), "GRanges")),
               lapply(segtables_TCGA_absolute_copynumber, function(i) as(data.frame(rename_cols(i)), "GRanges")),
               lapply(segtables_organoids_absolute_copynumber, function(i) as(data.frame(rename_cols(i)), "GRanges")))
subsample_samples = sample(1:length(all_ranges), 10)
subset_genes = c('MYC', 'CCNE1', 'PIK3CA', 'TERT', 'KRAS', 'PTEN', 'RB1', 'AKT1',
                 'AKT2', 'PARP1', 'PARP2', 'ATM', 'ATR', 'WEE1', 'TOP1', 'TUBB1')
# subsample_samples = sort(subsample_samples)
subsample_samples = 1:length(all_ranges)
# all_ploidies = outer(X = 1:length(gr_genes_of_interest),
# all_ploidies = outer(X =  which(geneInfo$genename %in% subset_genes),
#                      Y = subsample_samples, Vectorize(function(x, y) give_ploidy_gene(x,all_ranges[[y]])))
# all_ploidies = t(all_ploidies)
# rownames(all_ploidies) = names(all_ranges)[subsample_samples]
# ann_row = data.frame(cohort=all_ranges_cohort[subsample_samples], row.names = names(all_ranges)[subsample_samples])
# pheatmap(all_ploidies, annotation_row = ann_row, cluster_rows = FALSE)

## [1] 5
## [1] 76

pheatmap((all_ploidies_2_df[ann_row == 'Organoids',]), cluster_rows = FALSE, cluster_cols = FALSE,
         show_rownames = TRUE, scale = "none", color=myColor, breaks=myBreaks)

Comparisong RNASeq and copy number

all_ploidies_2_df_t = t(all_ploidies_2_df)
TPM = read.csv("../DE_resistant_sensitive/files/20191218_ViasM_BJ_orgaBrs_tpm.csv", stringsAsFactors = FALSE)
renaming1 = read_excel("../DE_resistant_sensitive/files/PDOnameProperSample_sWGS_RNAseq.xlsx")

matching_tpm = TPM[match(colnames(all_ploidies_2_df), TPM$gene_name),]
renaming1 = renaming1[match(colnames(matching_tpm), gsub('-', '', renaming1$sampleNameRNAseq)),]
colnames(matching_tpm)[!is.na(renaming1$PDO)] = renaming1$PDO[!is.na(renaming1$PDO)]

length(colnames(matching_tpm))
## [1] 33
length( colnames(all_ploidies_2_df_t))
## [1] 939
matching_samples = colnames(matching_tpm)[match(colnames(all_ploidies_2_df_t), colnames(matching_tpm))]
matching_samples = matching_samples[!is.na(matching_samples)]
matching_tpm = matching_tpm[,matching_samples]
all_ploidies_2_df_t = all_ploidies_2_df_t[,matching_samples]

dim(all_ploidies_2_df_t)
## [1] 16 17
dim(matching_tpm)
## [1] 16 17
df_correlation_tpm_ploidy = cbind.data.frame(ploidies=as.vector(all_ploidies_2_df_t), tpm=unlist(matching_tpm),
                 gene_name=rep(rownames(all_ploidies_2_df_t), ncol(all_ploidies_2_df_t)),
                 sample_name = rep(colnames(all_ploidies_2_df_t), each=nrow(all_ploidies_2_df_t)))

# ggplot(df_correlation_tpm_ploidy,
#        aes(x=ploidies, y=tpm, col=gene_name))+geom_point()+facet_wrap(.~sample_name)+scale_x_continuous(trans = "log2")

ggplot(df_correlation_tpm_ploidy,
       aes(x=ploidies, y=tpm, col=sample_name))+geom_point()+facet_wrap(.~gene_name, scales = "free")+scale_x_continuous(trans = "log2")+
    labs(x='Absolute CN', y='TPM')
## Warning: Removed 2 rows containing missing values (geom_point).

# JBLAB19936 == PDO16
renaming1[which(renaming1$sampleNameRNAseq == "JBLAB-19936"),]
## # A tibble: 1 x 4
##   ID        sampleName_sWGS PDO   sampleNameRNAseq
##   <chr>     <chr>           <chr> <chr>           
## 1 119025org JBLAB-19842     PDO16 JBLAB-19936
df_correlation_tpm_ploidy[1,]
##   ploidies  tpm gene_name sample_name
## 1 4.002733 1484       MYC       PDO16
all_ploidies_2_df['PDO16', 'MYC']
## [1] 4.002733
sort(TPM[which(TPM$gene_name == 'MYC'),])
##      seq_strand gene_seq_start gene_seq_end JBLAB19953 JBLAB19936 JBLAB19907
## 7363          1      127735434    127742950       1434       1484       1820
##      JBLAB19905 JBLAB19906 JBLAB19938 JBLAB19902 JBLAB19937 JBLAB19952
## 7363      18774        199       2017       2045       2112         24
##      JBLAB19954 JBLAB19903 JBLAB19904 JBLAB19940 JBLAB19942 JBLAB19941
## 7363         25       2670       2957       2958       3293       3926
##      JBLAB19951 JBLAB19917 JBLAB19955 JBLAB19925 JBLAB19921 JBLAB19950 seq_name
## 7363       4380        449       6070        619       7161       7596        8
##      JBLAB19920 JBLAB19939 JBLAB19916         gene_id gene_name symbol
## 7363         83        890        970 ENSG00000136997       MYC    MYC
##        gene_biotype
## 7363 protein_coding
## [1] "cohort"         "ploidy"         "organoids_bool"
## NULL
##  [1] "052665d1-ab75-4f40-be5a-b88154c8beed"
##  [2] "0664753b-7566-41e0-8006-7009c6735406"
##  [3] "0ae2193f-0d68-485a-b8c2-7568cbcce33e"
##  [4] "127b0f7d-d24e-48b7-ac25-d3f14a43952d"
##  [5] "1292e13b-d7c6-447b-a227-9a8113215580"
##  [6] "129de5b2-d9b0-4762-9ef8-72d98231fb50"
##  [7] "14ed7388-41ed-43d4-afb2-04cd6410d5d2"
##  [8] "16df7888-2480-4394-8856-d57a6ef371d2"
##  [9] "17ed8831-a261-42d9-8ff3-cf75a6cb2a24"
## [10] "1be8fa2c-8fea-4e8c-90db-c04d9fcdbf49"
##  [1] "052665d1-ab75-4f40-be5a-b88154c8beed"
##  [2] "0664753b-7566-41e0-8006-7009c6735406"
##  [3] "0ae2193f-0d68-485a-b8c2-7568cbcce33e"
##  [4] "127b0f7d-d24e-48b7-ac25-d3f14a43952d"
##  [5] "1292e13b-d7c6-447b-a227-9a8113215580"
##  [6] "129de5b2-d9b0-4762-9ef8-72d98231fb50"
##  [7] "14ed7388-41ed-43d4-afb2-04cd6410d5d2"
##  [8] "16df7888-2480-4394-8856-d57a6ef371d2"
##  [9] "17ed8831-a261-42d9-8ff3-cf75a6cb2a24"
## [10] "1be8fa2c-8fea-4e8c-90db-c04d9fcdbf49"
## [1] 3

##       MYC CCNE1 PIK3CA TERT KRAS PTEN RB1 AKT1 AKT2 PARP1 PARP2 ATM ATR WEE1
## PDO16  NA    NA     NA   NA   NA   NA  NA   NA   NA    NA    NA  NA  NA   NA
## PDO15  NA    NA     NA   NA   NA   NA  NA   NA   NA    NA    NA  NA  NA   NA
## PDO3    1    NA     NA   NA   NA   NA  NA   NA   NA    NA    NA  NA  NA   NA
## PDO9    1    NA     NA   NA   NA   NA  NA   NA   NA    NA    NA  NA  NA   NA
## PDO5   NA     1     NA   NA   NA   NA  NA   NA   NA    NA    NA  NA  NA   NA
## PDO6   NA     1     NA   NA   NA   NA  NA   NA   NA    NA    NA  NA  NA   NA
## PDO10  NA    NA     NA   NA   NA   NA  NA   NA   NA    NA    NA  NA  NA   NA
## PDO1    1     1     NA   NA   NA   NA  NA   NA   NA    NA    NA  NA  NA   NA
## PDO12  NA    NA     NA   NA   NA   NA  NA   NA   NA    NA    NA  NA  NA   NA
## PDO4    1    NA     NA   NA   NA   NA  NA   NA   NA    NA    NA  NA  NA   NA
## PDO2   NA     1      1   NA   NA   NA  NA   NA   NA    NA    NA  NA  NA   NA
## PDO18  NA    NA     NA   NA   NA   NA  NA   NA   NA    NA    NA  NA  NA   NA
## PDO7   NA    NA     NA   NA   NA   NA  NA   NA   NA    NA    NA  NA  NA   NA
## PDO17  NA    NA     NA   NA   NA   NA  NA   NA   NA    NA    NA  NA  NA   NA
## PDO8   NA    NA     NA   NA   NA   NA  NA   NA   NA    NA    NA  NA  NA   NA
## PDO13  NA    NA     NA   NA   NA   NA  NA   NA   NA    NA    NA  NA  NA   NA
## PDO14  NA    NA     NA   NA   NA   NA  NA   NA   NA    NA    NA  NA  NA   NA
## PDO11  NA    NA     NA   NA   NA   NA  NA   NA   NA    NA    NA  NA  NA   NA
##       TOP1 TUBB1
## PDO16   NA    NA
## PDO15   NA    NA
## PDO3    NA    NA
## PDO9    NA    NA
## PDO5    NA    NA
## PDO6    NA    NA
## PDO10   NA    NA
## PDO1    NA    NA
## PDO12   NA    NA
## PDO4    NA    NA
## PDO2    NA    NA
## PDO18   NA    NA
## PDO7    NA    NA
## PDO17   NA    NA
## PDO8    NA    NA
## PDO13   NA    NA
## PDO14   NA    NA
## PDO11   NA    NA

Amplifications/deletions of selected genes in organoids

##       MYC CCNE1 PIK3CA TERT KRAS PTEN RB1 AKT1 AKT2 PARP1 PARP2 ATM ATR WEE1
## PDO16   0     0      0    0    0    0   0    0    0     0     0   0   0    0
## PDO15   0     0      0    0    0    0   0    0    0     0     0   0   0    0
## PDO3    1     0      0    0    0    0   0    0    0     0     0   0   0    0
## PDO9    1     0      0    0    0    0   0    0    0     0     0   0   0    0
## PDO5    0     1      0    0    0    0   0    0    0     0     0   0   0    0
## PDO6    0     1      0    0    0    0   0    0    0     0     0   0   0    0
## PDO10   0     0      0    0    0    0   0    0    0     0     0   0   0    0
## PDO1    1     1      0    0    0    0   0    0    0     0     0   0   0    0
## PDO12   0     0      0    0    0    0   0    0    0     0     0   0   0    0
## PDO4    1     0      0    0    0    0   0    0    0     0     0   0   0    0
## PDO2    0     1      1    0    0    0   0    0    0     0     0   0   0    0
## PDO18   0     0      0    0    0    0   0    0    0     0     0   0   0    0
## PDO7    0     0      0    0    0    0   0    0    0     0     0   0   0    0
## PDO17   0     0      0    0    0    0   0    0    0     0     0   0   0    0
## PDO8    0     0      0    0    0    0   0    0    0     0     0   0   0    0
## PDO13   0     0      0    0    0    0   0    0    0     0     0   0   0    0
## PDO14   0     0      0    0    0    0   0    0    0     0     0   0   0    0
## PDO11   0     0      0    0    0    0   0    0    0     0     0   0   0    0
##       TOP1 TUBB1
## PDO16    0     0
## PDO15    0     0
## PDO3     0     0
## PDO9     0     0
## PDO5     0     0
## PDO6     0     0
## PDO10    0     0
## PDO1     0     0
## PDO12    0     0
## PDO4     0     0
## PDO2     0     0
## PDO18    0     0
## PDO7     0     0
## PDO17    0     0
## PDO8     0     0
## PDO13    0     0
## PDO14    0     0
## PDO11    0     0

Could you repeat the analysis taking into account the ploidy as recommended by Cosmic?

  • Gain:

§ average genome ploidy <= 2.7 AND total copy number >= 5

§ OR average genome ploidy > 2.7 AND total copy number >= 9

  • Loss:

§ average genome ploidy <= 2.7 AND total copy number = 0

§ OR average genome ploidy > 2.7 AND total copy number < ( average genome ploidy - 2.7 )

Thanks!

Saving the global ploidy for organoids

##    PDO16    PDO15     PDO3     PDO9     PDO5     PDO6    PDO10     PDO1 
## 1.752252 2.706511 2.921865 2.825767 2.617572 2.639571 2.886007 3.659920 
##    PDO12     PDO4     PDO2    PDO18     PDO7    PDO17     PDO8    PDO13 
## 1.865682 2.818707 2.595914 2.884280 1.719599 2.864533 1.715556 2.753563 
##    PDO14    PDO11 
## 2.805061 3.042383

How come the segments include all the genome?

full_GR_example <- c(GRanges_chroms, as(rename_cols(data.frame(segtables_ICGC_absolute_copynumber_US[[1]])), "GRanges"))
granges_example <- GenomicRanges::disjoin(full_GR_example, with.revmap=TRUE, ignore.strand=TRUE)
diploid_segments = granges_example[sapply(granges_example$revmap, function(i) !any(i > length(GRanges_chroms))),]
diploid_segments
## GRanges object with 68 ranges and 1 metadata column:
##        seqnames              ranges strand |        revmap
##           <Rle>           <IRanges>  <Rle> | <IntegerList>
##    [1]        1     8799389-8863518      * |             1
##    [2]        1 121352501-128899999      * |             1
##    [3]        1 247249601-249250621      * |             1
##    [4]        2             1-20015      * |             2
##    [5]        2 242985494-243199373      * |             2
##    ...      ...                 ...    ... .           ...
##   [64]       20   62615607-63025520      * |            19
##   [65]       21          1-14413101      * |            22
##   [66]       21   46937501-48129895      * |            22
##   [67]       22          1-12199999      * |            21
##   [68]       22   49687501-51304566      * |            21
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths
length(diploid_segments)
## [1] 68
chrlen
##       V1        V2
## 1   chr1 249250621
## 2   chr2 243199373
## 3   chr3 198022430
## 4   chr4 191154276
## 5   chr5 180915260
## 6   chr6 171115067
## 7   chr7 159138663
## 8   chrX 155270560
## 9   chr8 146364022
## 10  chr9 141213431
## 11 chr10 135534747
## 12 chr11 135006516
## 13 chr12 133851895
## 14 chr13 115169878
## 15 chr14 107349540
## 16 chr15 102531392
## 17 chr16  90354753
## 18 chr17  81195210
## 19 chr18  78077248
## 20 chr20  63025520
## 21  chrY  59373566
## 22 chr19  59128983
## 23 chr22  51304566
## 24 chr21  48129895
dim(natgen[[which_natgen]])
## [1] 692   7

Correlations between signatures and features

as.vector(number_of_segments$organoids[match(gsub('PDO', '', names(number_of_segments$organoids)), gsub('PDO ', '', rownames(org)))])
##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# plot(as.vector(number_of_segments$organoids[match(gsub('PDO', '', names(number_of_segments$organoids)), gsub('PDO ', '', rownames(org)))]),
#      org[,'s6'])
# plot(as.vector(number_of_segments$organoids[match(gsub('PDO', '', names(number_of_segments$organoids)), gsub('PDO ', '', rownames(org)))]),
#      org[,'s6']/org[,'s1'])

Ascites

ascites = read.csv("data/asorg_PDO.csv")
ascites = cbind(ascites, bool_ascites=c('Ascites','Organoid'))
ascites = cbind(ascites, sample_paired=rep(1:(nrow(ascites)/2), each=2))
ascites$sample = factor(ascites$sample, levels = ascites$sample)
ggplot(melt(ascites, id.vars=c('sample', 'bool_ascites', 'sample_paired')), aes(x=sample, y=value, fill=variable))+geom_bar(stat = "identity")+facet_wrap(.~bool_ascites, scales = "free_x", ncol=2)+
  scale_fill_brewer(palette="Dark2")

ggsave("figures/barplot_ascites_organoids.pdf", width = 7.5, height = 5)
all_natgen[[which_natgen]][ grepl("^PDO*", rownames(all_natgen[[which_natgen]])) ,]
##               s1         s2         s3         s4         s5         s6
## PDO16 0.39513362 0.06729010 0.34479167 0.00000000 0.00000000 0.00000000
## PDO15 0.07805583 0.08880970 0.22121535 0.08653376 0.06256201 0.01868108
## PDO3  0.23476399 0.20895012 0.17222179 0.15510111 0.02522517 0.20373782
## PDO9  0.23100571 0.19063614 0.18273151 0.16669477 0.04712384 0.15403496
## PDO5  0.24499178 0.13713431 0.25230386 0.11515151 0.02113735 0.13505719
## PDO6  0.23223089 0.13487559 0.23766055 0.10535977 0.02303475 0.14393959
## PDO10 0.16775703 0.08768235 0.14241022 0.11117560 0.04764289 0.02913949
## PDO1  0.74902974 0.01129943 0.03920020 0.15464498 0.00000000 0.04582564
## PDO12 0.32278569 0.05252049 0.38942033 0.00000000 0.01843226 0.00000000
## PDO4  0.00000000 0.31244307 0.27977345 0.14349184 0.02552924 0.00000000
## PDO2  0.33184565 0.22108360 0.19553794 0.12216320 0.00000000 0.09765492
## PDO18 0.44037064 0.12106185 0.01282912 0.07522457 0.07223509 0.01572073
## PDO7  0.42352838 0.09658702 0.33313915 0.00000000 0.04548010 0.00000000
## PDO17 0.31266661 0.12548692 0.03681070 0.15643020 0.09878791 0.05463981
## PDO8  0.33693831 0.07875594 0.37523605 0.00000000 0.05862745 0.01255366
## PDO13 0.25153201 0.13408810 0.23818425 0.00000000 0.01721645 0.03610800
## PDO14 0.06503843 0.21519235 0.18641303 0.18349631 0.02652480 0.10931065
## PDO11 0.70270308 0.09838846 0.05374411 0.00000000 0.00000000 0.02081968
##               s7
## PDO16 0.19278461
## PDO15 0.44414227
## PDO3  0.00000000
## PDO9  0.02777306
## PDO5  0.09422399
## PDO6  0.12289886
## PDO10 0.41419243
## PDO1  0.00000000
## PDO12 0.21684123
## PDO4  0.23876240
## PDO2  0.03171468
## PDO18 0.26255802
## PDO7  0.10126534
## PDO17 0.21517784
## PDO8  0.13788860
## PDO13 0.32287119
## PDO14 0.21402443
## PDO11 0.12434467
library(Ternary)

amalgamation = cbind(rowSums(all_natgen[[which_natgen]][,-c(1,3)]),
      all_natgen[[which_natgen]][,1],
      all_natgen[[which_natgen]][,3])
colnames(amalgamation) = c('Rest', 's1', 's3')
amalgamation1 = amalgamation[grepl("^PDO*", rownames(all_natgen[[which_natgen]])),]
amalgamation2 = amalgamation[!grepl("^PDO*", rownames(all_natgen[[which_natgen]])),]

head(amalgamation1)
##            Rest         s1        s3
## PDO16 0.2600747 0.39513362 0.3447917
## PDO15 0.7007288 0.07805583 0.2212153
## PDO3  0.5930142 0.23476399 0.1722218
## PDO9  0.5862628 0.23100571 0.1827315
## PDO5  0.5027044 0.24499178 0.2523039
## PDO6  0.5301086 0.23223089 0.2376605
head(amalgamation2)
##                   Rest         s1         s3
## TCGA-04-1331 0.4856440 0.51435599 0.00000000
## TCGA-04-1332 0.6156329 0.38436714 0.00000000
## TCGA-04-1335 0.1226447 0.73776608 0.13958926
## TCGA-04-1336 0.4553610 0.54463905 0.00000000
## TCGA-04-1337 0.1548348 0.84516523 0.00000000
## TCGA-04-1338 0.9268825 0.02109527 0.05202221
nPoints <- 4000L
coordinates <- cbind(abs(rnorm(nPoints, 2, 3)),
                     abs(rnorm(nPoints, 1, 1.5)),
                     abs(rnorm(nPoints, 1, 0.5)))

TernaryPlot(atip = colnames(amalgamation)[1], btip = colnames(amalgamation)[2], ctip = colnames(amalgamation)[3],
            main='Organoids')
ColourTernary(TernaryDensity(amalgamation1, resolution = 10L))
TernaryPoints(amalgamation1, col = 'red', pch = '.')
TernaryDensityContour(amalgamation1, resolution = 30L)

TernaryPlot(atip = colnames(amalgamation)[1], btip = colnames(amalgamation)[2], ctip = colnames(amalgamation)[3],
            main='Primary cohorts')
ColourTernary(TernaryDensity(amalgamation2, resolution = 10L))
TernaryPoints(amalgamation2, col = 'red', pch = '.')
TernaryDensityContour(amalgamation2, resolution = 30L)

(number_of_segments$organoids)
## 
##  PDO1 PDO10 PDO11 PDO12 PDO13 PDO14 PDO15 PDO16 PDO17 PDO18  PDO2  PDO3  PDO4 
##    64   178    64   130   149   288   212   118   155   113   165   202   393 
##  PDO5  PDO6  PDO7  PDO8  PDO9 
##   176   174   120   137   205
all((organoids_absolute_copynumber@assayData$copynumber) == organoids_absolute_copynumber2@assayData$copynumber, na.rm = T)

require(reshape2)
require(ggplot2)
exposures = readRDS("~/Desktop/ascites_exposures.rds")
exposures
ggplot(melt(cbind(org=rownames(exposures), exposures)), aes(x=org, y=value, fill=variable))+geom_bar(stat='identity')

org == exposures

book_ascites = readxl::read_xlsx("/Users/morril01/Documents/PhD/other_repos/b_tape/Vias_Brenton/copy_number_analysis_organoids/data/Book1.xlsx")

exposures = exposures[match(book_ascites$`JBLAB-number`, rownames(exposures)),]
exposures = exposures[!colSums(apply(exposures, 1, is.na)) == 7,]
rownames(exposures) = book_ascites$organoid[match(book_ascites$`JBLAB-number`, rownames(exposures))]

## Data from organoids

rename_rows = function(i, new_names){
  rownames(i) = new_names; return(i)}


createBarplot(rename_rows(exposures, gsub('Sample ', 'PDO', rownames(exposures))), remove_labels = FALSE, order_labels = gsub('Sample ', 'PDO', names(sort(exposures[,1])))) + 
  scale_fill_brewer(palette="Dark2")+labs(y='Exposure')+
  ggtitle('Exposures for the organoids')+labs(x='')+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

gsub('org', '', rownames(exposures))
matched_ascites = ascites[match(gsub('org', '', rownames(exposures)), (ascites$sample)),2:ncol(ascites)]
all.equal(matched_ascites, exposures)
matched_ascites[1,] == exposures[1,]
matched_ascites[1,]
exposures[1,]